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1. INTRODUCTION 

The public is discussing various issues through different media platforms. Social media is one of the 
media channels where public users can share their opinions. However, it inherits some limitations, such as 
the media types, including video, audio, text, or emoticons, with maximum text length, file format, and other 
limitations. The various limitations of the social media platform could become another problem for social 
media analytic. Thus, our research focused on opinions in short text messages shared by a user through social 
media channels (especially Twitter). It is because the decision-makers are only interested in the sentiment of 
opinions exchanged during the conversations in social media, not the detailed messages in the discussion. 
Nowadays, public opinions, in the form of trending topics or virality, could drive and change public policy. 

Sentiment analysis is a study to understand people's opinions, sentiments, emotions on something 
from text messages [1]. The object of emotions might includes products, services, persons, or other topics of 
conversation. Sentiment analysis of short texts is challenging because its context might present in other short 
messages or other conversations. Nevertheless, researchers have done various research in sentiment analysis 
using a statistical model with lexicon methods [2]-[4], machine learning [5]-[9], to semi-supervised methods 
with deep learning [10]-[12]. In order to get good sentiment accuracy, researchers often come with additional 
techniques such as data tuning and model tuning. Data tuning uses domain-specific data, which is trained, 
and tuned to build models on that specific domain. This technique expects overfitting and a good result 
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within that specific domain. On the other hand, model tuning uses various algorithms and combinations to get 
a new model, which gives results with better accuracy. On deep learning methods, reordering architectures 
could build a model with better results. For example, combining convolution neural network (CNN) for its 
coarse-grained features extractions and recurrent neural network (RNN) for its sequential features 
relationship could give better results [13]. This research explores various architectures of deep neural 
networks for sentiment analysis. We use the architectures to build a satisfactory model for a specific domain 
using its domain-specific data and evaluate the model performance against another domain with its domain- 
specific dataset. For that purpose, we employ two different domains employment and telecommunication. 

A great extent of employment issues is frequently discussed in social media, from compensation and 
employee tenure [14], wages and working locations [15], to the labor market and gender issues [16]-[17]. On 
the other hand, social media can be used to inform the worker about government policy, labor issues, and 
bridging communication between the public, laborers, and policymakers. The sentiment of those interactions 
is essential for policymakers. Another domain with frequent discussion is related to telecommunication 
operators are also frequently discussed in social media. Those can be related to operator performance on user 
service, bonuses, bandwidth availability, downtime, and connection quality. Mobile network operators can 
use the discussion to direct their business policy and services [18]-[20]. 

In this research, we explore various deep learning methods and architectures and a combination of 
methods to build models for sentiment analysis on short texts in Bahasa Indonesia. We use domain-specific 
datasets (on employment and telecommunication issues) for model evaluation and compare the models' 
performance on the dataset. This research has three-fold contributions in evaluating hybrid neural network 
methodologies for sentiment analysis on short text messages in Bahasa Indonesia and its application in 
different domain datasets. They are (1) explaining techniques in building models with hybrid neural network 
architecture with a domain-specific dataset to induce overfitting model; (2) evaluating the architecture on 
different domain with direct model implementation and model tuning with another domain-specific dataset; 
and (3) providing an experience for model migration from one domain to another. 

The structure of this paper is presented by defining the purpose of the research in the first section, 
which is followed by short reviews of various sentiment analysis methods with additional emphasis on deep 
neural network techniques and hybridization of neural network architectures in the second section. The third 
section explains the research methods and experiments from data collection, pre-processing, and model 
building using various techniques for sentiment analysis to compare their analysis results. The fourth section 
gives experiment results and analysis. The conclusion is given in the last section. 


2. HYBRID NEURAL NETWORK ARCHITECTURE FOR SENTIMENT ANALYSIS 

Even though there is a limited natural language resource for Bahasa Indonesia, researchers have 
conducted various studies on sentiment analysis for text in Bahasa Indonesia, such as using lexicon-based 
techniques [21]-[22], machine learning [23]-[24], and deep learning [25]-[27]. Le et al. [25] use the LSTM 
and CNN for analyzing the sentiment of 900 thousand Indonesian tweets. They obtained 73.22% accuracy 
with a standard deviation of 1.39 using LSTM without normalization. Franky and Manurung [24] evaluate 
several classification techniques; Naive Bayes, Maximum Entropy, and Support Vector Machines using 
unigram and word frequency feature on Bahasa Indonesia translation of movie reviews. They achieved 
78.82% accuracy, which they considered satisfactory due to simple translation compared to 80.09% accuracy 
when those techniques were applied directly to the original English movie reviews. Problems that often occur 
in the use of social media with Indonesian text are unstructured text data and non-standard languages. Putra, 
et al. [28] showed the use of hybrid models for sentiment analysis by distributing lexicon based on and 
maximum entropy gives a good evaluation score with 84.31% accuracy. 

The hybrid neural network architecture is a combination of several neural network architecture not 
limited to multi-layer perceptron (MLP), CNN, or long short-term memory (LSTM). Figure 1 shows a hybrid 
neural network architecture that consists of MLP, CNN, and LSTM. The hybrid architecture is making use of 
the best of other architectures. The MLP is a simple model of neural networks that accommodates the use of 
the previous model to obtain classification output. In contrast, the CNN layer uses its local feature advantages 
to extract the word feature vector from text sequences. LSTM repetitively selects or discards feature 
sequences based on their context. 

The feedforward method on MLP multiplies each input neuron with a weight to produce feature 
maps as a feature vector and passes it to the next layer through networks of neurons. It will carry the learning 
process before obtaining the £} vector as the output. On the other hand, CNN uses word vectors as the input 
dataset and processes them through convolutional layers, pooling layers, and fully connected layers. At the 
end of the CNN process, the fully connected layer produces p, vector map. The similar words vector is fed 
into the LSTM to be processed by extracting the input features using various filters and evaluating the 
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sequence of features within their context in a timely order. Thus, it produces another feature sequence p3. 
The output vectors 6, from MLP, f, from CNN, and f} from LSTM are combined. The simplest 
combination function is by concatenating those outputs into the input of the classification process using the 
sigmoid activation function. 
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Figure 1. Hybrid neural network architecture 


The basic concept of this hybridization is to concatenate the prior process's output as the input to the 
next one. In this study, we explore those three concept models; MLP, CNN, and LSTM. We use each 
architecture as the baseline of the neural network models and create a hybrid with the combination of 
MLP+CNN+LSTM. We store the result models of the last layer process on every single architecture as a 
parameter vector during the training phase. Those parameter vectors are merged to provide the final process 
for model classification output. We use sigmoid as the activation function for determining the classification 
output after the merging of the parameter vectors. It is because the sigmoid function can give a rational value 
output between 0 and 1. The sentiment value is determined by splitting a lower and higher fraction of the 
output value. The output value of O to 0.500 is considered a negative sentiment, while it is 0.5001 to 1 as a 
positive one. 


3. RESEARCH METHOD 

We evaluate our approach in the hybrid neural network to evaluate its effectiveness for sentiment 
analysis of short messages in Bahasa Indonesia, especially in employment and mobile telecommunication 
topics. The analytics phases are described in Figure 2. 
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Figure 2. Analytics phases in sentiment evaluation 


3.1. Data collection 

The data is obtained from the Twitter collection provided and has been labeled by Ivosights. The 
dataset is classified into positive and negative sentiment. We use human annotators. They are native Bahasa 
Indonesia speakers because short text messages do not use standard Bahasa Indonesia. However, most are a 
mixture of Bahasa Indonesia, acronym, slang, English, Arabic, Javanese, or other local languages, emoticons, 
emojis, and other symbols. The annotator gives a sentiment label on each tweet message using their sense of 
how they would feel about the tweet's sentence. The most agreed classification by several annotators 
determines the final sentiment classification. 


3.2. Data extraction 

We take two different topics and characteristics from the Twitter dataset, which relate to 
employment and telecommunication operators' issues. There are 67,106 tweets in Bahasa Indonesia relate to 
Employment issues, which discuss issues ranging from welfare to social security to salary. There are 30,102 
tweets with positive sentiments and 37,004 tweets with negative sentiments from January 2017 to November 
2017. On the other hand, there are 81,159 tweets from January 2017 to May 2018 in Bahasa Indonesia about 
telecommunication operators' related issues, where 40,938 tweets of those are labeled as positive sentiments 
and 40,221 tweets others as negative sentiments. We select Bahasa Indonesia tweet messages only and delete 
tweet messages with local, foreign, or slang language and split the dataset into 80% of training data and 20% 
of test data. 


3.3. Dataset pre-processing 

Data pre-processing is carried out to prepare a clean dataset for further process. We clean the text 
messages by filtering unnecessary characters, such as punctuation characters, ASCII codes, tokens with non- 
alphabetic characters. The sentence is tokenized. Each token and word excluding Bahasa Indonesia's stop 
words [29] is called vocabulary, which is added to build a dictionary. We group tokens from the word 
dictionary by their sentiment into vocabulary files representing words from every record with similar 
sentiment polarity. There are 14,075 tokens with 27,736 dictionaries and 13,994 tokens with 36,264 
dictionaries in the employment and telecommunication issue. The use of Indonesian stop words is beneficial 
for sorting out words that are compliant according to Indonesian language requirements, including five 
features of twitters (mention, hashtag, URL, discourse maker, and emoticons). The pre-process generates a 
data dictionary for each dataset. Figure 3 shows the top ten vocabularies in employment issues (a) and 
telecommunication issues (b). 


3.4. Hybrid neural network model development for sentiment analysis 

We build several models with CNN, MLP, LSTM and the hybrid of those three architectures. 
Firstly, we build a text representation where each word is represented as a matrix of real numbers using word 
embedding. A real number representation is required because the neural network model can only accept 
numerical input values. The word embedding technique is suitable for the neural network model because it 
keeps the order and interaction of the words within sentences and the probability functions of each word 
sequence. Thus, it can be more expressive than the classical model, such as bag-of-words, bigram, or trigram 
[30]. The next step is model building; we use CNN, MLP, LSTM, and the hybrid of those three models for 
document classification. The CNN uses a filter value for parallel work processing, kernel value, and 
activation function. The output of this stage is a two-dimensional vector that represents the extracted features. 
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The LSTM units consist of cells, input gates, output gates, and forgotten gates. The MLP uses a back-end 
model for feature interpretation. The output layer uses an activation function with values between 0 for 
negative sentiment and 1 for positive sentiment. After that, we look for the best training models which 
produce the maximum accuracy using gradient descents. A topological change of the architectural model is 
required to find the optimal configuration for minimizing the error. The last stage is storing the best- 
generated model for evaluation. 
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Figure 3. Top vocabularies in employment issues, (a) employment issues, (b) telecommunication issues 


3.5. Model evaluation 

We evaluate the generated model from the training process with the new testing dataset. Evaluation 
is carried out with a similar estimation function for evaluating training and testing datasets, where new data is 
encoded with a similar scheme of training data encoder. 


4. RESULTS AND DISCUSSION 
4.1. Experiment results 

Twitter raw data collected offline on the employment and telecommunication issues are stored in a 
JSON file format. The data contains mentions, hashtags, URLs, discourse makers, and emoticons filtering, 
which could make an unexpected analysis result. Thus, the documents are extracted and converted into text 
files for the pre-processing stage. A clean dataset is modified to provide features according to the need for 
sentiment analysis to be analyzed. The pre-processing and cleaning of the document from retweet, double 
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post, and spam messages returns 89,373 messages on the telecommunication topic and 75,126 tweets on the 
employment issues topic. Table 1 shows the number of tweets that collected each sentiment feature and the 
number of cleaned tweets that are prepared after data extraction and pre-processing. 


Table 1. Dataset detail 











Dataset name Properties Number 
Employment issues Number of data 124,951 
Cleaned data 72,126 
Positive sentiment 30,102 
Negative sentiment 37,044 
Neutral sentiment 67,022 
Vocabularies 27,736 
Telecommunication issues Number of data 299,997 
Cleaned data 89,373 
Positive sentiment 40,938 
Negative sentiment 40,221 
Neutral sentiment 218,838 
Vocabularies 36,264 





The training dataset is prepared by placing all the files that have been grouped into files with 
positive sentiments labeled with class 1 and files with negative sentiments labeled with class 0. The training 
is done by tokenization and word embedding to build vectors of real numbers as the word representation. The 
tokenization is representing documents as a sequence of consecutive integer numbers. The number amps a 
single token as a vector-specific representation of a real number. The placement of vector numbers is 
randomly assigned during the training process, and an embedding layer's API can be used to create class 
initiation for all document datasets. Since the inputs for the training process must have the same vector size, 
tokens in the documents used as the inputs should be sorted and padded with Os if the number of tokes is less 
than the defined vector size. 

Each word of each document is represented as a vector of real numbers using word embedding. The 
vectors are passed through various deep neural network architectures for document classification of 
sentiment analysis. The Convolutional Neural Network architecture consists of filtering, kernel size 
definition, and pooling layer for classification output simplification. The LSTM architecture will produce 
output after passing through forget, input, cell, and output gates. Furthermore, the MLP will classify after 
passing through input, several hidden, and output layers. The hybrid architectures are designed by appending 
one architecture to another, such as CNN and MLP, LSTM and MLP, and CNN and LSTM and MLP. 

Figure 4 (a) shows the performance of implemented neural network architectures to get the 
minimum loss error value in the training stage. It shows that the hybrid (CNN+LSTM+MLP) outperforms the 
model algorithm with a more optimal level of error generated. The results of the hybrid model apply to both 
datasets that are provided as input models. Besides that, the LSTM only architecture is in the second-best. 
The result shows that LSTM could well enough to find context in sentiment analysis. The MLP shows the 
worst performance by generating a significant loss error because it used a feed-forward algorithm with a one- 
time learning process for its every layer. The training accuracy result is shown in Figure 4 (b) affirms the 
training loss error by showing again that the hybrid (CNN+LSTM+MLP) architecture has the best optimal 
performance for both trained datasets. 
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Figure 4. Training evaluation, (a) loss error, (b) accuracy 
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In the testing phase, we evaluate the loss error of the experimental results. Figure 5 (a) shows a 
significant difference between the loss error of evaluation on the employment issues and the 
telecommunication issues dataset. However, the main concern in this study is the performance difference of 
different architectures. It shows that the single CNN and hybrid (CNN+LSTM+MLP) architecture show 
optimal performance. It might because CNN is pretty good for extracting features from the text in sentences. 
After all, at the testing phase, the trained model is tested against 20% of the dataset. The result shows that the 
hybrid architecture has an optimal minimum loss error compared to other single architectures’ performance. 
Figure 5 (b) shows the accuracy performance of various architectures on test datasets. It shows that the hybrid 
(CNN+LSTM+MLP) architecture gives the best accuracy in both employment and telecommunication datasets. 
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Figure 5. Testing evaluation, (a) loss error, (b) accuracy 


4.2. Discussion 

The previous section describes the results of our experiments in using seven different neural 
network architectures (MLP, CNN, LSTM, CNN+MLP, CNN+LSTM, LSTM+MLP, and 
CNN+LSTM+MLP) for sentiment analysis of short messages in Bahasa Indonesia on employment and 
telecommunication domains. The MLP architecture uses a simple one and two hidden layers. It takes input 
from the dataset according to the maximum length defined at the pre-processing stage. The best result of the 
learning process with MLP is obtained when using two hidden layers. It gives 0.6335 and 0.3797 on training 
loss error for employment and telecommunication issues datasets and 71.47% and 92.34% on training 
accuracy for both datasets, respectively. However, the loss error and the accuracy against both testing 
datasets are significantly worsened. They are to 1.1275 for testing loss error and 57.90% accuracy for 
employment issues and 0.1068 loss error and 85.92% accuracy on the telecommunication test dataset. 

The CNN architecture is more reliable in extracting features from sentences that are defined in 
"negative" and "positive" sentiments directory. The loss error and accuracy in the training phase are better 
than those values on MLP with 0.1870 and 0.2227 on training loss error for employment and 
telecommunication issues datasets and 97.53% and 97.63% on training accuracy for each dataset. The values 
are not consistent on the testing phase where the loss error and accuracy for employment issues dataset is 
worsened to 0.4446 on testing loss error and 81.71% on testing accuracy. In contrast, those values for the 
telecommunication issues dataset becomes much better to 0.0201 on testing loss error and 99.45% on testing 
accuracy. The result might because the telecommunication issues dataset has more records and richer 
vocabularies than the employment issues dataset. 

The LSTM architecture is more context-oriented on the input sentence encoding. In this experiment, 
we add a useful dropout function to partially process the LSTM function before the output process to 
improve accuracy, minimize overfitting and set the value to a maximum of 0.5. The experiment shows that 
this architecture achieves better than MLP or CNN, especially on a larger dataset (such as the 
telecommunication issues dataset). The training phase gives optimal loss error (0.0581 and 0.0387) and good 
accuracy 97.30% and 98.15% for employment and telecommunication issues datasets. However, similar to 
the CNN, the testing loss error and accuracy worsen in the employment issues dataset to 0.9353 for testing 
loss error and 65.60% for testing accuracy, and better in telecommunication issues dataset to 0.0133 for 
testing loss error and 99.77% for testing accuracy. 

The hybrid LSTM and MLP architecture is built by concatenation, where the result of LSTM and 
MLP processes are combined to obtain a single output. The final sentiment classification process is 
determined using the sigmoid activation function. The experiment shows that appending MLP does not help 
to improve the architecture performance, where the loss error and accuracy in the training and evaluation 
phase are somewhat worse than those in LSTM only architecture. The training loss error on hybrid LSTM 
and MLP architecture is 0.2135 and 0.2608 on the employment and telecommunication issues dataset, while 
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training accuracy is 96.35% and 96.31% for each dataset. Like other experiments, the analysis on the test 
dataset of employment issues is worse than the training one. Its loss error is 0.7832 and accuracy 69.00%. On 
the other hand, the analysis on the test dataset of telecommunication issues is better where its loss error and 
accuracy are 0.0335 and 99.45%, respectively. 

The hybrid (CNN and MLP) architecture is created by combining MLP at the end of the feature 
extraction on CNN, before executing the flatten function that bridges the CNN process with the output 
process and giving a pool size of two in the max polling process. The results are quite good and better than 
hybrid LSTM and MLP in the training stage. The testing stage is slightly better but provides a comparable 
trial average value. Loss error values in the training stage are 0.4440 and 0.2620 for the employment and 
telecommunication issues dataset, while its accuracy is 94.22% and 96.61% for each dataset. The testing 
evaluation on the employment issues dataset is slightly better than single CNN or other dual hybrid 
architecture with a loss error of 0.5855 and 0.0107 for the employment and telecommunication issues dataset. 
Its accuracy is 82.24% and 99.70% for both datasets. 

The hybrid CNN and LSTM use a combination of feature extraction with context orientation. At the 
end of the fixed output, it uses the sigmoid activation function for returning the sentiment classification. The 
performance values are quite good as the training loss error is 0.2199 and 0.2600 for the employment and 
telecommunication issues dataset, while its accuracy is 95.91% and 96.41% for each dataset. The architecture 
still has difficulties in employment issues' testing dataset while loss error is 0.6830, and the accuracy is 
74.60%. In contrast, the telecommunication issues' testing dataset is evaluated with better performance as 
0.6017 for its loss error and 98.50% for its accuracy. 

The hybrid (CNN, LSTM, and MLP) architecture is built by concatenating CNN processes with 
LSTM processes and MLP processes. This integration makes a more complicated learning process, which 
results in a longer training computation time. However, that gives significant benefits for the architecture 
performance in sentiment classification. The training loss errors are 0.0490 and 0.0347 for the employment 
and telecommunication issues dataset. Their accuracies are 98.85% and 98.39% for each dataset. The 
architecture also shows its best performance for the testing dataset of the employment and the 
telecommunication issues, with loss error is 0.4850 and 0.0020. at the same time, their accuracies are 85.48 % 
and 99.85%. 

The performance of implemented architectures for sentiment classification can be listed as Table 2 
for their performance on training datasets of employment and telecommunication issues and Table 3 for their 
respective testing datasets. We expect the optimal architecture can give the lowest loss error and the highest 
accuracy. On training and testing datasets, the hybrid (CNN+LSTM+MLP) shows its superiority over other 
architectures by giving the lowest loss error and highest accuracy on both training and testing datasets. Even 
though on the employment issues testing dataset, the CNN architecture gains the lowest loss error, but the 
hybrid (CNN+LSTM+MLP) is the second-best. 








Table 2. Training performance for each model on Table 3. Testing performance for each model on each 
each input domain datasets input domain datasets 
Parameters Employment dataset Jeleeoumunication Parameters Employment dataset Telecommünication 
dataset dataset 
Max. Error MLP MLP Max. Error MLP MLP 
Min. Error CNN+LSTM+MLP CNN+LSTM+MLP Min. Error CNN CNN+LSTM+MLP 
Min. MLP MLP Min, MLP MLP 
Accuracy Accuracy 
Max, CNN+LSTM+MLP CNN+LSTM+MLP Max. CNN+LSTM+MLP CNN+LSTM+MLP 
Accuracy Accuracy 





Figure 6 shows the model plot of the hybrid (CNN+LSTM+MLP) architecture. The input for each 
model contains vocabulary tokens with a limit of maximum length of 962 words. The CNN converts the 
input with word embedding by mapping each word into 100-dimensional vectors. The convolutional process 
extracts features from the input data with 32 parameters with four kernels. In order to reduce overfitting and 
optimization, we use a 0.5 dropout to minimize the computation process. The CNN is stopped on the flatten 
function because its result will be merged with results from other models. Similar to CNN side, the LSTM is 
initiated by applying word embedding on the input data. We use ten nodes neural network layer with the 
ReLu activation function for context-based processing to generate a ten-dimensional output vector using a 
0.2 dropout optimization. The MLP side uses three hidden layers that consist of 20, 10, and 10 nodes with 
varying input dimensions. The flattened output CNN, LSTM, and MLP are merged by concatenation. We 
process the merged output using a feedforward neural network with a single hidden layer with ten nodes and 
predict the analysis output using a sigmoid function to give a positive and negative sentiment classification. 
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Figure 6. Hybrid model diagram 














The result shows that combinations of CNN, LSTM, and MLP architectures give the best results and 
can be used in different domain datasets from its training domain. It is because the hybrid architecture gets 
the best of its building architectures models; the MLP that accommodates the use of the previous model to 
obtain classification output, the CNN layer that extracts the word feature vector from text sequences, and 
LSTM that repetitively selects or discards feature sequences based on their context. Those advantages are 
useful for different domain datasets, where one architecture might be better than others, but the combination 
might make it constantly similarly good. The experiments on sentiment analysis of short text in Bahasa 
Indonesia show that hybrid models can obtain better performance. The same architecture can be directly used 
in another domain-specific dataset. 


5. CONCLUSION 

Deep learning is a valuable method for developing sentiment analysis on short text messages, 
especially on limited resource language. It is due to its ability to infer and extract hidden information from a 
large number of data. CNN, LSTM, MLP architectures have shown their ability to solve the classification 
problems in natural language processing with a pretty good result. This research confirms that a hybrid 
architecture (CNN+LSTM+MLP) can give an even better result, and its architecture can directly be applied 
in different domain datasets. It is because the hybrid architecture can utilize the advantage of each building 
component. 
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